TA-DRD: A Three-step Automatic Duplicate Record Detection
نویسندگان
چکیده
Duplicate record detection is a key step in Deep Web data integration, but the existing approaches do not adapt to its large-scale nature. In this paper, a three-step automatic approach is proposed for duplicate record detection in Deep Web. It firstly uses cluster ensemble to select initial training instance. Then it utilizes tri-training classification to construct classification model. Finally, it uses evidence theory to combine the results of multiple classification models to construct the domain-level duplicate record detection model which can be used for large-scale duplicate record detection in the same domain. Experimental results show that the proposed approach is better than previous work and and the domainlevel duplicate record detection model can get high performance.
منابع مشابه
A New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملDuplicate detection algorithms of bibliographic descriptions
Purpose – The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach – Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail commun...
متن کاملAutomatic Detection of Microaneurysms in Color Fundus Images using a Local Radon Transform Method
Introduction: Diabetic retinopathy (DR) is one of the most serious and most frequent eye diseases in the world and the most common cause of blindness in adults between 20 and 60 years of age. Following 15 years of diabetes, about 2% of the diabetic patients are blind and 10% suffer from vision impairment due to DR complications. This paper addresses the automatic detection of microaneurysms (MA...
متن کاملRecord Matching Over Query Results Using Fuzzy Ontological Document Clustering
Record matching is an essential step in duplicate detection as it identifies records representing same real-world entity. Supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. To overcome the problem, a new record matching method named Unsupervised Duplicate Elimination (UDE) is p...
متن کاملA Comparative Study in Classification Techniques for Unsupervised Record Linkage Model
Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two records are duplicate or not, supervised and unsupervised classification techniques are utilized in ...
متن کامل